{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "21cfea06-af98-496a-b13b-106c335a2e65",
   "metadata": {},
   "source": [
    "# Optional schema fields\n",
    "\n",
    "`SchemaField`s can be declared optional, allowing the user to ingest records where that particular field is missing. Non-optional `SchemaField`s will raise at ingestion of the data is missing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "1bf67328-efe5-4c88-9c36-ce2d4b20d89f",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install superlinked==37.5.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "11664035-fff3-4c38-97f3-f2fbb0d46778",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from superlinked import framework as sl\n",
    "\n",
    "pd.set_option(\"display.max_colwidth\", 100)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e74b7c5a-8146-49bd-9bf1-16a904d8b1a6",
   "metadata": {},
   "source": [
    "## Set up optional fields"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "2df24eaf-f9b8-404b-8b7f-0a9d2c2284df",
   "metadata": {},
   "outputs": [],
   "source": [
    "class Paragraph(sl.Schema):\n",
    "    id: sl.IdField\n",
    "    body: sl.String\n",
    "    like_count: sl.Integer | None  # configuring an optional SchemaField"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92b7fc35-f4b0-4f7f-85d5-b706e19c408c",
   "metadata": {},
   "source": [
    "This way one can ingest records where `like_count` is missing.\n",
    "\n",
    "Now let's set up a basic config to see it working."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "ba5aad5c-f29b-4eb4-8b5d-c7e558c6f12d",
   "metadata": {},
   "outputs": [],
   "source": [
    "paragraph = Paragraph()\n",
    "\n",
    "body_space = sl.TextSimilaritySpace(text=paragraph.body, model=\"sentence-transformers/all-mpnet-base-v2\")\n",
    "like_space = sl.NumberSpace(number=paragraph.like_count, min_value=0, max_value=100, mode=sl.Mode.MAXIMUM)\n",
    "\n",
    "paragraph_index = sl.Index([body_space, like_space])\n",
    "\n",
    "source: sl.InMemorySource = sl.InMemorySource(paragraph)\n",
    "executor = sl.InMemoryExecutor(sources=[source], indices=[paragraph_index])\n",
    "app = executor.run()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2947599-3560-4e82-b382-503c33f0cda3",
   "metadata": {},
   "source": [
    "## Ingesting records with missing data\n",
    "\n",
    "`like_count` containing None, or the record simply missing the key results in the system handling the field as missing data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "61d76e31-72f4-4404-8cc1-31c08d08d978",
   "metadata": {},
   "outputs": [],
   "source": [
    "source.put(\n",
    "    [\n",
    "        {\n",
    "            \"id\": \"paragraph-1\",\n",
    "            \"body\": \"Glorious animals live in the wilderness.\",\n",
    "            \"like_count\": 10,\n",
    "        },\n",
    "        {\n",
    "            \"id\": \"paragraph-1-missing-key\",\n",
    "            \"body\": \"Glorious animals live in the wilderness.\",\n",
    "        },\n",
    "        {\n",
    "            \"id\": \"paragraph-2\",\n",
    "            \"body\": \"Growing computation power enables advancements in AI.\",\n",
    "            \"like_count\": 100,\n",
    "        },\n",
    "        {\n",
    "            \"id\": \"paragraph-2-missing-None\",\n",
    "            \"body\": \"Growing computation power enables advancements in AI.\",\n",
    "            \"like_count\": None,\n",
    "        },\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ebf4511f-a4c9-48f8-8a28-ca01b5b63966",
   "metadata": {},
   "source": [
    "But `body`, which is not configured as optional is going to raise for any of these inputs that constitute a missing value"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "a411d4eb-a7be-4804-8dd3-9bf0f343d93b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(\"The SchemaField Paragraph.body doesn't have a default value and was not provided in the ParsedSchema.\",)\n"
     ]
    }
   ],
   "source": [
    "try:\n",
    "    source.put(\n",
    "        [\n",
    "            {\n",
    "                \"id\": \"paragraph-x\",\n",
    "                \"body\": None,\n",
    "                \"like_count\": 10,\n",
    "            }\n",
    "        ]\n",
    "    )\n",
    "except Exception as e:\n",
    "    print(e)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "429b166b-79d9-4f4a-8b6b-a99555dd9c3e",
   "metadata": {},
   "source": [
    "## Querying items with missing values\n",
    "\n",
    "Missing values do not influence query results, they effectively produce zero scores in terms of that particular attribute. Let's showcase that with a query!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "198f0bce-5d7c-412b-a989-d0ab60982770",
   "metadata": {},
   "outputs": [],
   "source": [
    "query = (\n",
    "    sl.Query(paragraph_index, weights={body_space: sl.Param(\"body_weight\"), like_space: sl.Param(\"like_weight\")})\n",
    "    .find(paragraph)\n",
    "    .similar(body_space, sl.Param(\"query_text\"))\n",
    "    .select_all()\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "aaa75750-3b1c-4d68-af9c-133a249f6b0b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>body</th>\n",
       "      <th>like_count</th>\n",
       "      <th>id</th>\n",
       "      <th>similarity_score</th>\n",
       "      <th>rank</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Growing computation power enables advancements in AI.</td>\n",
       "      <td>100.0</td>\n",
       "      <td>paragraph-2</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Growing computation power enables advancements in AI.</td>\n",
       "      <td>NaN</td>\n",
       "      <td>paragraph-2-missing-None</td>\n",
       "      <td>0.500000</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Glorious animals live in the wilderness.</td>\n",
       "      <td>10.0</td>\n",
       "      <td>paragraph-1</td>\n",
       "      <td>0.096102</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Glorious animals live in the wilderness.</td>\n",
       "      <td>NaN</td>\n",
       "      <td>paragraph-1-missing-key</td>\n",
       "      <td>0.017885</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                    body  like_count  \\\n",
       "0  Growing computation power enables advancements in AI.       100.0   \n",
       "1  Growing computation power enables advancements in AI.         NaN   \n",
       "2               Glorious animals live in the wilderness.        10.0   \n",
       "3               Glorious animals live in the wilderness.         NaN   \n",
       "\n",
       "                         id  similarity_score  rank  \n",
       "0               paragraph-2          1.000000     0  \n",
       "1  paragraph-2-missing-None          0.500000     1  \n",
       "2               paragraph-1          0.096102     2  \n",
       "3   paragraph-1-missing-key          0.017885     3  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = app.query(\n",
    "    query, query_text=\"Growing computation power enables advancements in AI.\", body_weight=1.0, like_weight=1.0\n",
    ")\n",
    "\n",
    "sl.PandasConverter.to_pandas(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5240182-09ce-4258-86f3-d5f108356327",
   "metadata": {},
   "source": [
    "We can easily observe that by searching with the exact same text, the maximum like count `paragraph-2` has perfect score, while the paragraph with the same text but missing like count (`paragraph-2-missing-None`) has exactly half score."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "superlinked-py3.11",
   "language": "python",
   "name": "superlinked-py3.11"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
